Skip to content

Conversation

@srielau
Copy link
Contributor

@srielau srielau commented Oct 28, 2025

What changes were proposed in this pull request?

We propose expanding the IDENTIFIER() clause, which turns a string into a qualified identifier, to all places identifiers can appear. The current clause is severely limited in where it can go because it accepts constant expressions, including session variables.
Due to the complexity of the argument the existing clause requires tricky code to incrementally analyze its arguments and then execute sections of parser code at a later point.
By contrast the generalized IDENTIFIER clause only allows string literals which can be processed in the visitor methods.
Due to the rework of parameter markers and string coalescing this allows for constructs such as:

SELECT * FROM IDENTIFIER(:cat '.' :schema '.' :table)

it even allows:

SELECT 'hello' AS IDENTIFIER(:alias);

This is really all identifier() needs. We may be able to deprecate and de-support the existing too complex identifier() implementation.

Why are the changes needed?

IDENTIFIER() is a popular feature, but it can only be used in very specific, hard to reason about places.
The new implementation preserved 99% of teh fucntion while expanding its use to everywhere.

Does this PR introduce any user-facing change?

Yes, it's a new feature

How was this patch tested?

expanded Parameters and inentifier-clause test suites.

Was this patch authored or co-authored using generative AI tooling?

Yes, Clause Sonnet 4.5

@github-actions github-actions bot added the SQL label Oct 28, 2025
@dongjoon-hyun dongjoon-hyun marked this pull request as draft October 28, 2025 18:26
… maintainability

- Extract IDENTIFIER_PREFIX constant for magic string
- Improve getMultipartIdentifierText documentation with complete Scaladoc
- Narrow exception handling from Exception to ParseException
- Remove redundant ParserUtils prefix in class methods
- Fix qualified label validation to check resolved identifiers
- Ensure all comments are complete sentences ending with periods
- Remove dead code and improve variable naming
- Fix FOR loop variable resolution to use getMultipartIdentifierText

All tests pass:
- SqlScriptingParserSuite (qualified label validation)
- SqlScriptingE2eSuite (identifier tests)
- SQLQueryTestSuite (identifier-clause and identifier-clause-legacy)
@srielau srielau marked this pull request as ready for review November 6, 2025 05:15
@srielau srielau changed the title [DRAFT] [SPARK-53573][SQL] IDENTIFIER everywhere [SPARK-53573][SQL] IDENTIFIER everywhere Nov 6, 2025
Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial round of review. Thanks for working on this!

@cloud-fan
Copy link
Contributor

thanks, merging to master/4.1!

@cloud-fan cloud-fan closed this in daa29f6 Nov 13, 2025
cloud-fan pushed a commit that referenced this pull request Nov 13, 2025
### What changes were proposed in this pull request?

We propose expanding the IDENTIFIER() clause, which turns a string into a qualified identifier, to all places identifiers can appear. The current clause is severely limited in where it can go because it accepts constant expressions, including session variables.
Due to the complexity of the argument the existing clause requires tricky code to incrementally analyze its arguments and then execute sections of parser code at a later point.
By contrast the generalized IDENTIFIER clause only allows string literals which can be processed in the visitor methods.
Due to the rework of parameter markers and string coalescing this allows for constructs such as:
```
SELECT * FROM IDENTIFIER(:cat '.' :schema '.' :table)
```
it even allows:
```
SELECT 'hello' AS IDENTIFIER(:alias);
```

This is really all identifier() needs. We may be able to deprecate and de-support the existing too complex identifier() implementation.

### Why are the changes needed?

IDENTIFIER() is a popular feature, but it can only be used in very specific, hard to reason about places.
The new implementation preserved 99% of teh fucntion while expanding its use to everywhere.

### Does this PR introduce _any_ user-facing change?

Yes, it's a new feature

### How was this patch tested?

expanded Parameters and inentifier-clause test suites.

### Was this patch authored or co-authored using generative AI tooling?

Yes, Clause Sonnet 4.5

Closes #52765 from srielau/identifier-lite.

Authored-by: Serge Rielau <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit daa29f6)
Signed-off-by: Wenchen Fan <[email protected]>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 13, 2025

Hi, @srielau and @cloud-fan .

Is this a mistake? Or, did you want to make a huge followup? At the first glance, the JIRA Issue title looks weird to me.

$ git log --oneline | grep SPARK-53573
daa29f6ede2 [SPARK-53573][SQL] IDENTIFIER everywhere
4cf7772a84f [SPARK-53573][SQL] Allow coalescing string literals everywhere
983d384222d [SPARK-53573][SQL] Use Pre-processor for generalized parameter marker handling

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 13, 2025

Especially, this PR has 8k lines. It's a little hard for a PR of 8k lines to be a follow-up.

Screenshot 2025-11-12 at 20 30 26

@srielau
Copy link
Contributor Author

srielau commented Nov 13, 2025

Hi, @srielau and @cloud-fan .

Is this a mistake? Or, did you want to make a huge followup? At the first glance, the JIRA Issue title looks weird to me.

$ git log --oneline | grep SPARK-53573
daa29f6ede2 [SPARK-53573][SQL] IDENTIFIER everywhere
4cf7772a84f [SPARK-53573][SQL] Allow coalescing string literals everywhere
983d384222d [SPARK-53573][SQL] Use Pre-processor for generalized parameter marker handling

This was intentional, albeit perhaps improper procedure.
These three PR's go together. The goal is to support parameterizing literals and identifiers everywhere.

  1. Allow Parameters everywhere by moving substitution to a pre-parser
  2. Expands string literal coalescing from constant to basic string literals
  3. Fold IDENTIFIER of string literals during parsing
    => We can now have
    spark.sql("SELECT 1 AS IDENTIFIER('C' :colordinal)"), Map("colordinal" -> "1"))
    and
    spark.sql("CREATE TABLE ... LOCATION :root '/somedir'", Map("root" -> "/volA")

I will work with @cloud-fan to structure it into subtasks

* @see
* [[org.apache.spark.sql.catalyst.parser.AstBuilder]] for the full SQL statement parser
*
* ==CRITICAL: Extracting Identifier Names==
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this CRITICAL: mean line 72 ~ 73?

  • '''DO NOT use ctx.getText() or ctx.identifier.getText()''' directly! These methods do not
  • handle the IDENTIFIER('literal') syntax and will cause incorrect behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. ideally I would have liked to block or override getText() but have not found a way to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Got it~

* Example:
* {{{
* // WRONG - does not handle IDENTIFIER('literal'):
* val name = ctx.identifier.getText
Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, is this always wrong? What about the currently remaining (existing) code in AstBuilder.scala and SparkSqlParser.scala like the following?

$ git grep ctx.identifier.getText
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala: * '''DO NOT use ctx.getText() or ctx.identifier.getText()''' directly! These methods do not
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala: *   val name = ctx.identifier.getText
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:    val collationName = ctx.identifier.getText
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:      lazy val name: String = ctx.identifier.getText
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:      lazy val name: String = ctx.identifier.getText
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:        ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:      ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:    val indexName = ctx.identifier.getText
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala:        ctx.identifier.getText.toLowerCase(Locale.ROOT) match {
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala:        ctx.identifier.getText.toLowerCase(Locale.ROOT) match {

Copy link
Contributor Author

@srielau srielau Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! These slipped through the cracks. Must have fat-fingered my own grep to miss out on those.
I'll create a follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants